Correcting Sampling Bias in Structural Genomics through Iterative Selection of Underrepresented Targets
نویسندگان
چکیده
In this study we proposed an iterative procedure for correcting sampling bias in labeled datasets for supervised learning applications. Given a much larger and unbiased unlabeled dataset, our approach relies on training contrast classifiers to iteratively select unlabeled examples most highly underrepresented in the labeled dataset. Once labeled, these examples could greatly reduce the sampling bias present in the labeled dataset. Unlike active learning methods, the actual labeling is not necessary in order to determine the most appropriate sampling schedule. The proposed procedure was applied on an important bioinformatics problem of prioritizing protein targets for structural genomics projects. We show that the procedure is capable of identifying protein targets that are underrepresented in current protein structure database, the Protein Data Bank (PDB). We argue that these proteins should be given higher priorities for experimental structural characterization to achieve faster sampling bias reduction in current PDB and make it more representative of the protein space.
منابع مشابه
Removing GPS collar bias in habitat selection studies
1. Compared to traditional radio-collars, global positioning system (GPS) collars provide finer spatial resolution and collect locations across a broader range of spatial and temporal conditions. However, data from GPS collars are biased because vegetation and terrain interfere with the satellite signals necessary to acquire a location. Analyses of habitat selection generally proceed without co...
متن کاملCorrecting sample selection bias in maximum entropy density estimation
We study the problem of maximum entropy density estimation in the presence of known sample selection bias. We propose three bias correction approaches. The first one takes advantage of unbiased sufficient statistics which can be obtained from biased samples. The second one estimates the biased distribution and then factors the bias out. The third one approximates the second by only using sample...
متن کاملDetection of Underrepresented Biological Sequences using Class-Conditional Distribution Models
A labeled sequence data set related to a certain biological property is often biased and, therefore, does not completely capture its diversity in nature. To reduce this sampling bias problem a data mining procedure is proposed for detecting underrepresented relevant sequences. The procedure is aimed at helping domain experts achieve a cost-effective qualitative enlargement of knowledge through ...
متن کاملThe Pattern of Structural Relationships of Relapse of Individuals with Substance Use Disorder based on Attentional Bias and Reward Sensitivity with the Mediating Role of Inhibition Control
Objective: The aim of this study was to investigate the pattern of structural relationships of relapse in individuals with substance use disorder based on attentional bias and reward sensitivity with the mediating role of inhibition control. Method: The present study was descriptive-correlation of structural equation modeling type. The statistical population of this study included all withdrawi...
متن کاملCorrecting the Site Frequency Spectrum for Divergence-Based Ascertainment
Comparative genomics based on sequenced referenced genomes is essential to hypothesis generation and testing within population genetics. However, selection of candidate regions for further study on the basis of elevated or depressed divergence between species leads to a divergence-based ascertainment bias in the site frequency spectrum within selected candidate loci. Here, a method to correct t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005